A Corpus based Probabilistic Grammar with Only Two Non terminals
نویسندگان
چکیده
The availability of large syntactically bracketed corpora such as the Penn Tree Bank a ords us the opportunity to automatically build or train broad coverage grammars and in particular to train probabilistic grammars A number of recent parsing experiments have also indicated that grammars whose production probabilities are dependent on the context can be more e ective than context free grammars in selecting a correct parse To make maximal use of context we have automatically constructed from the Penn Tree Bank version a grammar in which the symbols S and NP are the only real non terminals and the other non terminals or grammatical nodes are in e ect embedded into the right hand sides of the S and NP rules For example one of the rules extracted from the tree bank would be S NP VBX JJ CC VBX NP where NP is a non terminal and the other symbols are terminals part of speech tags of the Tree Bank The most common structure in the Tree Bank associated with this expansion is S NP VP VP VBX ADJ JJ CC VP VBX NP So if our parser uses rule in parsing a sentence it will generate structure for the corresponding part of the sentence Using of the Penn Tree Bank for training we extracted distinct rules for S and for NP We also built a smaller version of the grammar based on higher frequency patterns for use as a back up when the larger grammar is unable to produce a parse due to memory limitation We applied this parser to Wall Street Journal sentences separate from the training set and with no limit on sentence length Of the parsed sentences the percentage of no crossing sentences is and Parseval recall and precision are and
منابع مشابه
Studying impressive parameters on the performance of Persian probabilistic context free grammar parser
In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...
متن کاملA Corpus - based Probabilistic Grammar with Only TwoNon - terminalsSatoshi SEKINE
The availability of large, syntactically-bracketed corpora such as the Penn Tree Bank aaords us the opportunity to automatically build or train broad-coverage grammars, and in particular to train probabilistic grammars. A number of recent parsing experiments have also indicated that grammars whose production probabilities are dependent on the context can be more eeective than context-free gramm...
متن کاملShared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction
We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM ...
متن کاملLearning Common Grammar from Multilingual Corpus
We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that i...
متن کاملDeep non-probabilistic parsing of large corpora
This paper reports a large-scale non-probabilistic parsing experiment with a deep LFG parser. We briefly introduce the parser we used, named SXLFG, and the resources that were used together with it. Then we report quantitative results about the parsing of a multi-million word journalistic corpus. We show that we can parse more than 6 million words in less than 12 hours, only 6.7% of all sentenc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003